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A method for gene mapping from chromosome and phenotype data 
Field of the invention 

The present invention relates to a method for gene mapping from chromosome and 
5 phenotype data, which utilizes linkage disequihbrium between genetic markers m U 
which are polymorphic nucleic acid or protein sequences or strings of single- 
nucleotide polymorphisms deriving from a chromosomal region. 

Background of the invention 

Gene mapping aims at discovering a statistical connection from a particular disease 
10 or trait to a narrow region in the genome probably containing a gene that affects the 
trait In particular, the discovery of new disease susceptibility genes can have an 
immense importance for human health care. The gene and the proteins it produces 
can be analyzed to understand the disease causing mechanisms and to design new 
medicines. Further, gene tests on patients can be used to assess individual risks and 
15 for preventive and individually tailored medications. Obviously, gene mapping is 
receiving increasing interest among medical industry. 

Genetic markers along chromosomes provide data that can be used to discover as- 
sociations between patient phenotypes (e.g., diseased vs. healthy) and chromosomal 
regions (i.e., potential disease gene loci). The growing number of available genetic 
20 markers, anticipated to reach hundreds of thousands in the next few years, offers 
new opportunities but also amplifies the computational complexity of the task. 

Human genome sequencing efforts, the first ones now almost complete, read the full 
human DNA sequence. There are methods for recognizing where there are genes m 
the sequence — the number of which is currently estimated to be around 30,000. 
25 However, we lack methods for deriving the function of a gene from the sequence in- 
formation. Gene mapping approaches this problem for one disease at a time. It aims 
at discovering areas in the genome - hopefully small - that have a statistical con- 
nection to a given trait, thus narrowing down the area to be analyzed with expensive 
laboratory methods. 

30 A typical setting for gene mapping is a case-control studytfiorne" chromosome of 
diseased and healthy individuals. Instead of looking at the DNA of the whole chro- 
mosome, only certain marker segments distributed along the chromosome are con- 
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sidered By the analysis of similarities within the disease-associated chromosomes 
on one hand and the differences between the disease-associated and control chro- 
mosomes on the other, one can try to locate likely areas for a gene mat predisposes 
people to the disease analyzed. 
5 The overall goal of the method according to the invention is to locate a disease- 
' susceptibility gene for a given disease. In gene mapping, the aim is to identify a nar- 
T cCmoIomal region within which the gene is likely to be; this area can then be 
analyzed in more detail with laboratory tools. We next briefly review tile genetic 
background; without toss of generality, we restrict the discussion in this paper to 
10 one chromosome. 

Marker data 

A genetic marker is a short polymorphic region in the DNA, denoted here by Ml 
M2 The different variants of DNA that different people have at the marker are 

called 'alleles, denoted in our examples by 1, 2, 3 The number of alleles per 

marker is small', typically less than ten for microsatellite markers, and exactly two 
for single nucleotide polymorphisms (SNP). The collection of markers used in a 
particular study is its marker map, and the corresponding alleles in a given chromo- 
some constitute its haplotype (Figure 1). It is a major task of a gene mapping study 
to design the marker map and to obtain the haplotype data. That is where we start 
and for the purposes of this paper the input data consists of haplotypes of diseased 
and control persons - or, in computer science terms, aligned allele stnngs, classified 
to positive and negative examples. 

T inka pte dise q uilibrium 

All the current carriers of a disease-susceptibility gene have inherited it from a 
25 founder who introduced the gene mutation to the population. If there has been only 
one or few such founders, then many of the current carriers are related, may share 
some segments of the chromosome, and lend themselves to gene mapping studies. 
In particular, segments from the mutation carrying founder chromosomes are over- 
represented among the affected at mutation locus. Relatively young (e.g. 1000 
30 years) population isolates are promising sources of data in this respect: disease- 

susceptibility genes may have been introduced by one or two founders only, and the 
gene may be over-represented in the population. Kainuu region m eastern Finland is 
an example of such a fruitful area for genetic studies. 
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If there are conserved regions at the mutation locus, then it can be possible to ob- 
serve linkage disequilibrium (LD), or non-random association between nearby 
markers (Figure 2). There are severe statistical problems, however, in observing 
LD Mutation carriers often only have a higher risk of being diseased than non- 
5 carriers, and in a case-control study both groups can be mixes of carriers and non- 
carriers Further on, since the selection of patients is more or less random, and the 
whole coalescence process leading to LD is stochastic, it is a challenge to recognize 
LD and the DS gene location from all the noise. 

de.rxe ma pping 

10 In diseases with a reasonable genetic contribution, and especially in population iso- 
lates affected individuals are likely to have higher frequencies of certain alleles and 
haplotype patterns near the DS gene than control individuals. This is the starting 
point of LD-based mapping methods: where does the set of affected chromosomes 
show linkage disequiHbrium? The problem is far from trivial, however. The coales- 

15 cence process is stochastic; mutation carriers often only have a higher risk of being 
diseased than non-carriers, and in a case-control study both groups are usually 
mixes of carriers and non-carriers; and finally, there is missing information and hap- 
lotyping ambiguities. 

Most current gene mapping methods based on linkage disequihbrium look just at 
20 individual markers or neighboring markers, measure their association to the disease 
status and predict the gene locus to be co-located with the strongest association. 
However, since different mutation carriers share different segments, there is no sin- 
gle marker or pattern that is representative of the shared segments. 

In the recent years, several statistical methods have been proposed to detect LD 
25 (Terwilliger 1995, Devlin et al. 1996, Lazzeroni 1998, Service et al. 1999, McPeek 
et al 1999). The emphasis has been on fairly involved statistical models of LD 
around a DS gene. They model whole recombination histories and some are robust 
to high levels of heterogeneity. On the other hand, the models are based on a num- 
ber of assumptions about the inheritance model of the disease and the structure of 
30 the population, which may be misleading for the statistical inference. The methods 
tend to be computationally heavy and therefore better suited for fine mappmg than 
genome screening. 

Haplotype Pattern Mining or HPM (Toivonen et al. 2000a, Toivonen et al. 2000b) is 
based on analyzing the LD of sets of haplotype patterns, essentially strings with 
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wUdcaid characters. The method first finds all haplotype patterns that are strongly 
"ted widt the disease status, nsing ideas similax to die discovery of assocuhon 
^es (Agrawal et ah 1993, Agrawal et al. 1996). Since the patterns may contain 
Sp Z can accountfor some missing and erroneous data. In the second step, each 
mLer is ranked by the number of patterns that contain it. Either mis scorn is used 
as a basis for the prediction or, preferably, a permutation test is used to obtem 
marker-wise p values. HPM has been extended for detecting mulhple genes simul- 
taneously (Toivonen et al. 2000b) and to handle quantitative phenotypes and co- 
variates (Severn et al. 2001). 

Nakaya et al (Nakaya et al. 2000) investigate the effect of multiple separate mark- 
ers, each one thought to correspond to one gene, on quantitative phenotypes. Then- 
work is a generalization of the LOD score to multiple loci, and it does not handle 
haplotype patterns. 

An alternative approach for LD-based mapping is linkage analysis. The idea is to 
Ijyze family trees, and to find out which markers tend to be inhented to o fsprmg 
in conjunction with the disease. Linkage analysis does not rely or .common foun- 
ders, so in that respect it is more widely applicable than LD-based mefliods The 
downside is that estimates are rough (due to the smaller effective number of meiosis 
sampled), and that collecting information from larger families is more difficult and 
20 expensive. 

Transmissionydisequihbrium tests (TDT) (Spielman et al. 1993) are an established 
way of testing association and linkage in a sample where linkage disequilibrium ex- 
ists between the mutation locus and nearby marker loci. TDT detects deviations be- 
tween observed and expected counts for each allele transmitted from heterozygous 
25 parents to affected offspring. 

Single permutation tests have been used in mapping studies before (Churchill and 
Doerge 1994, Laitinen et al. 1997, Long and Langley 1999). However, rf more 
complex data is to be analyzed, these single permutation tests are too expensive and 
computationally very ineffective and even inoperative. 

30 Genetic markers provide an economical, sparse view of chromosomes. Even 

sparsely located markers can be very informative: given an ancestor with a disease 
gene, the descendants that inherit the gene are also likely to inherit a string of alleles 
of nearby markers. The exact probability of inheriting any combinahon of markers 
depends on the gene location with respect to the markers, the population history or 
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the coalescence history, and marker mutations; all of these are unknown. There is a 
continuous need for more effective gene mapping methods. 

The object of the present invention is to provide a novel method for gene mapping 
from chromosome and pheno type dam. The method according to the invenhon .coor 
5 aiders toe recombination histories - sort of f aunly trees - that are bkely to have 
cled the observed trees of patterns. The disease susceptibility CDS gene i ti«n 
predicted to be where the strongest genetic contribution is visible m the trees. The 
contributions of the method according to the invention are: 

(1) a novel approach to gene mapping using tree patterns, 
10 (2) an efficient algorithm for generating and testing tree patterns, 

(3) a method for the estimation of statistical significance of individual findings as 
well " the whole process, based on multiple permutations but earned out at the cost 
of a single permutation. 
Summary of the invention 
15 It is an object of the present invention to provide a method for gene mapping from 
ch^osome and phenotype data, which utilizes linkage disequilibrium between 
^"markers J- which are polymorphic nucleic acid or protein sequences or 
Sags Tf single^cleotide polymorphisms deriving from a chromosomal region. 
The method of the invention comprises steps of 
20 i) identifying a prefix tree T based on the observed haplotypes at a num- 

ber of locations of a chromosome, 

SB evaluating each prefix tree T by its genetic and statistical feasibility, as- 
suming that me gene was close to me root of the tree, and thus deter- 
mining a score for each prefix tree T, 
25 iii) predicting the area for the location of the gene as a function of the score 

determined in the step (ii). 
The present invention is now explained in detail by referring to the attached figure* 
examples. These examples are only used to show some of the embodiments and 
are not intended to limit the scope of the invention. 
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Brief description of the drawings 

Figure 1. A marker map of ten markers and a sample haplotype consisting of aHeles 
in adjacent markers. 

Fisnre 2 A carrier of the ancestral mutation has inherited founder alleles around 
STete loT These alleles are similar to those of me ancestral chromosome m 
0 Due to me common inherited segment, many of the contemporary mu- 

SS. are expected to share alleles in the markers around the mutation, but 

the length of the snared haplotype vanes. 

Figure 3. A possible coalescence tree at me fourth marker for the three observed 

An alternative coalescence tree would have -344- instead ot I A* 
level. 

Figure 4. An illustration of the tree sttucture in a string-sorted set of haplotypes to 
the right from the location pointed by the arrow. 

5 Figures. Analysis of the performance of TreeDT. A: Gene loc ah ^onpow^ 
wtth different values of A, the proportion of disease-associated chromosomes that 
« ^ the mutation. B: Gene localization power with different numbers of 
Sees Cmlod parameter) and different numbers of founders (popuhttion parame- 
5 C Gasification accuracy for me existence of a disease susceptibly gene. 

,0 Figure 6. Comparison of the gene localization performance of TreeDT HPM u mul- 
SL TDT (m-TDT), and TDT. A: The baseline test setting, B: The baseline set- 
ting with three founders. C: The baseline setting with 5% missing data. 

Detailed description of the invention 

It is an object of the present invention to provide a method for gene mapping aiming 
25 to discover a gene region affecting a certain trait using chromosome data. 

Empirical evaluation on arealistic, simulated data shows that the method ^cording 
to the invention is competitive with other recent data mining based methods, and 
clearly outperforms more traditional methods. Our experiments, explained later, 
sh ow that the method according to the invention, TreeDT, is effective m extteme 
30 conditions typical for current mapping problems: with lots of noise (only 10-20% of 
aLcted chromosomes carry the mutation, lots of missing data) and witii small s^- 
ple sizes (200 affected and 200 control chromosomes). However, the highest poten- 
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Hal of the method according to the invention lies in the data intensive tasks of future 
- such as genome scanning with larger samples and larger number of markers - due 
to its low computational complexity. 

In comparison to state of the art methods, TreeDT is most competitive. In terms of 
5 gene localization accuracy, it gave best results in the case of multiple founders and 
demonstrated good robustness with respect to missing data. Unlike the compared 
methods TreeDT can be used to predict whether a gene is present at all or not. Fi- 
nally, in'comparison to its closest competitor, HPM, TreeDT has much smaller 
computational cost. An additional advantage of TreeDT is that it has only one input 
10 parameter, the (maximum) number of deviant subtrees, whereas for HPM one has to 
set several more or less arbitrary thresholds. 

Method 

For any pair of chromosomes in the sample there has been a common origin in the 
population history, an ancestral chromosome at which their paths have diverged 

15 Due to recombinations different parts of chromosomes have different histories. At 
any given location the chromosomes in the sample and their most recent common 
origins form a coalescence tree. In the coalescence tree for the DS gene location, all 
the chromosomes in one or more subtrees carry the DS mutation, and we shouM ob- 
serve excess of disease-associated haplotypes as the leaves of these subtrees. The 

20 closer the tree is located to the DS gene, the more and larger subtrees are identical 
to those in the tree at the DS gene location. 

Based on the observed haplotypes, the method of the invention defines a prefix tree 
estimating the most likely coalescence tree at a number of locations along the ana- 
lyzed chromosome, and then assesses the subtree clustering of disease-associated 
25 haplotypes in these trees. 

It is a further object of this invention to provide a novel tree disequilibrium test, in- 
tended for predicting DS gene locations in the method of the invention. The vicinity 
of the location for which the test gives the lowest p value is the most likely candi- 
date area for the DS gene location. The method also calculates the corrected overall 
p value for the best finding. This p value can be used for predicting whether the 
chromosome carries a DS gene. 

Further a method for the estimation of statistical significance of individual findings 
as well as the whole process, based on multiple permutations but earned out at the 
cost of a single permutation, is provided. 
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Haplotype prep: trees 

T^subsumption relation of the JX^S"' 
rected acyclic graph (DAG). The tree struck <***°***£^ ta Kgure 
n^consideredaspossi^^^ 

we coalescence hee e*34^ g rf ^ chromosomes 

see that trie oraer given . nnf i p( 5 mav represent a 

types may also share a substring by chance "^^^Lalescence 

rrltolr^ ^^".aigenunrberofcoalescencenodesiscon- 

short, and thereiore n is> vciy j voun^er coalescence nodes 

tailed in the empty subshring root On the oth* hand 

* shared regions spanning over several markers are more hkely 

one correspondence with observed recurrent substrings. 

Tree disequilibrium test 

According to "'^^^"^^^^^^'^e^j^re h^o&esit ^TheX^ributionc^ 

r^mly dtaoW m ft. tan*. tfT. TreeDTidentt &emll 
the observed status distribution deviates most from the expectation n 
h^esis, and returns the significance of me deviation as a p ^es 

Mhle ^ maiority of the mutation earners is concentrated in a omy i 

°^ to " m/shared region is long enough to identify a deviant substring. In 

the experiments for this paper we use an upper limit of 6 subtrees. 
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For measuring the disequihbrium of a tree, we use a variant of the Z test. The test 
statistic Z k for a tree with k deviant subtrees T h ...,T k is 
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where a, is the number of disease-associated haplotypes and n { the total number of 
haplotypes in subtree T, e S, and/? is the proportion of disease-associated haplo- 
types in the sample. The score measures the distance of the observed number of dis- 
ease-associated chromosomes from the expectation (n iP ) in standard deviations 
(the square root of n iP (l-p)), under the assumption of binomial distribution with pa- 
rameters m and p. We use a one-tailed test, since we are interested only m subtrees 
in which the proportion of disease-associated haplotypes is greater than expected. 

We could use a 2x(*+l) ^-statistic as a measure of deviation for a given subtree set 
S. The ^-statistic, however, is not easily maximized in the space of all possible sub- 
tree sets and is therefore not a very practical choice. 

Z k can be efficiently maximized simultaneously for all k using a recursive algo- 

15 rithm, as shown in the Algorithms section. 

TreeDT takes the maximum number of deviant subtrees as a parameter. In principle 
there is no need to set an upper limit for the subtree count, but whenever LD- 
mapping is applicable, the majority of the mutation carriers is concentrated in only 
few such subtrees in which the shared region is long enough to identify a deviant 

20 substring. In the experiments for this paper we use an upper limit of 6 subtrees. 

Significance tests with nested permutations 

Zk is a measure for the disequihbrium of a given tree, corresponding to a certain lo- 
cation in the chromosome, with given k deviant subtrees. Given a ^, TreeDT 
finds for each k the set S of subtrees that maximizes Z k . In order to find the best k 
25 for the given tree, simple maximization is not possible. Since the statistics or ef- 
ferent degrees of freedom ft are not comparable, TreeDT estimates the p value for 
each maximized Z k (under the null hypothesis of random distribution of disease 
status). Because the distribution of the maximized Z k is very complex and depend- 
ent on the tree structure, p values are estimated by a permutation test. 
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In order to get a single p value for fire diseqnUibrinm at a 

combine the information from the trees to the left and to the right of me location. As 
a combined measure we use the prodnct of the lowest p valne over aB k from each 
side. Again, since the measures are not necessarily direetly comparable, a new p 
5 value for the combination is estimated. The results are now comparable between 
different locations. 

The output of TreeDT is essentially the p value ranked list of locations. A point 
prediction for the gene location is obtained by taking the best location; a ^ntmUy 
fragmented) region of length I is obtained by taking best locations until a length of I 
10 is covered. 

Since multiple locations are tested for a p value, and also since the p values at 
nearby locations are not independent, a direct link between the p value ,«nd the 
probability mat the gene is indeed close to the location can not be established. The p 
values are used simply as a method of ranking the locations. 

15 However, a single corrected p value for the best finding can be obtained witi, a third 
test using the lowest local p value as the test statistic. This p valne can also be used 
to answer the question whether there is a gene in the investigated are m the first 
place or not. 

All these three nested p value tests (for each tree and *, for each location, for the 
20 best location) can be carried out efficiently at the cost of a single test Table 1 sum- 
marizes the three levels of the nested test. 



Table 1. A summary of the permutation test procedure. 



Leve 
1 


For 
each 


Test statistic 


Result 


1 


(T,k) 


max Z k (S,T) over all 
S 6 SubtreeSetsiT) 


p(T,k) 


2 


t 


rmnp(Ti,k{)p(TM 
over all k\, k 2 


p(t), the p value of the tree disequilibrium 

test for the pair 
t = (T U T 2 ) of left- and right-side trees rooted 
at the same location 


3 




min p(t) over all t 


p. the corrected overall p value 
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Al gorithms 

Constructing the haplotype prefix-trees 

The haplotype prefix-trees to the left and right from each analyzed location can be 
efficiently identified using a string-sorting algorithm. The algorithm produces as m- 

5 termediate results for each marker the sorted list of the partial haplotypes to the 
right from the marker. All the right-side trees can be easily derived from these in- 
termediate lists, because the haplotypes belonging to a single node form a continu- 
ous block in the sorted list. The left-side trees can be identified similarly by sorting 
the inverted haplotypes. The computational cost of constructing the trees is neghgi- 

10 ble compared to the cost of the permutation test procedure. 

The same process can also be used to enumerate all the recurrent substrings, or all 
the closed substrings. A substring s is closed, if and only if none of its superstrmgs 
match all the same haplotypes than s. The nodes in the right-side prefix trees have 
one-to-one correspondence to recurrent substrings starting at the same marker 
15 Nodes that are to be split in the next step of the sort algorithm correspond to closed 
substrings. 

An algorithm for maximizing the tree disequilibrium statistic 

It is essential that the time-complexity of the algorithm for maximizing me Z-values 
is as low as possible, because it must be executed for each tree location and pennu- 
20 tation in turn. The key observation is that if S is the set of * deviant subtrees of J 
with the greatest value of Z h T is a subtree of T, and 5' c S is a set of m subtrees 
in 7", then S> has the maximum value of in T". Also, if S = S, U. . .u S n , and k is 
the subtree count in 5, and h is the subtree count in S b then 

Z t (S) = 2X (>?,.). 

i 

These observations lead us to the following recursive algorithm that propagates the 
25 locally maximized Z-values upwards in the tree: 



Input: A haplotype prefix tree T 

Output: Maximum values of Z k in the tree T for each k 

Call Maximize(T) 
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Maximize(T): 
If T is not a leaf: 

1. For each immediate subtree T t of T : Recursively call Maximize^). 

2 For each fc calculate the maximum value Zmax, *0) for Z fe (S,D over all S that 
i can be obtained by combining subtree sets from each subtree T, of T. 

3. Calculate Z x for T. If Zi > Zmax. iCO then set Zmax, iCD : = Zi- 

If T is a leaf, then set Zmax, iCD *• = °- 



Step 2 can be further refined: 
10 2.1 Set 7, : = 0 and ZsIA x. ^ : = 0 for all 1 < * < n, where > is the number of 
leaves in T. 

2.2 For each subtree T of T: 

2 2.1 For each pair (ij), 1 < i < P and 1 < J < * where p is the number of leaves in 
T and q is the total number of leaves in all the subtrees processed prior to T . 

15 If Zmax, iCO + Yj > Zmax, then set Z max, h*W = z max. + ^ 

2.2.2 For each k, 1 < k <p: 

If Zmax. *<T') > Zmax, *CD> then set Zmax, k(T) : = Zmax, *<T). 

2.2.3 For each k, 1 < k<p+q: 

If Zmax, *CD > ^C 7 )' then set Ffc(7 ^ : = Zmax> ^ 
20 The time complexity of the algorithm is 0(n 2 ) (proof omitted), where n is the num- 
ber of leaves in the tree i.e. the number of haplotypes in the data set. By setting an 
upper limit k for the size of the subtree sets, the average time complexity can be re- 
duced to 0(n) with a constant coefficient proportional to k 2 , k being typically small, 
<10. 
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An efficient algorithm for multiple nested permutation tests 

The straightforward algorithm for a three-level nested permutation test using nested 
loops would have time complexity of 0(r?qr\ where n is the number of Permuta- 
tions at each level, q is the time complexity of maximizing the Z*-staustic for all A:, 
5 and r is the number of tested locations in the chromosome. The test would be intrac- 
table already with rather small permutation counts. However, the tune complexity 
can be drastically reduced using the same set of permutations at each level of the 
test and thus only maximizing the Rvalues n instead of n 3 times for each location. 

1. Compute Zmax, kW = max Z k {T,S) for each subtree count k and each coalescence 
10 tree T over all S e SubtreeSets(T). 

2. Randomly generate n+1 permutations of disease-association statuses for the hap- 
lotypes and for each permutation i and (T,k): compute Zmax, k&T) - max 
Z k {i,T,S) over all S e SubtreeSets(T). 

II Level 1 
15 3. For each (T,k): 

3.1 Calculate a p value p(T,k) by comparing Zmax, fcCD to Zmax, &T), 1< i ^ «• 

3.2 For each permutation i: calculate ap value p(i,T,k) by comparing Zmax, kH,T) 
to all Zmax, kti.T) >J 

II Level 2 

20 4. For each pair of opposed trees rooted at the same location t = (JiJi)' 

4.1 Choose /WO = m&p(T h kMT 2 ,k 2 ) over all ft lf * 2 

4.2 For each permutation i: choose pvMti = rnm ^7^) p(j,T 2 ,fc 2 ) over all * lf 

4.3 Calculate a p value p(0 by comparing Pmin(0 to pminG. 0 , 1£ 1 £ n. 

25 4.4 For each permutation i: calculate ap value p(i,0 by comparing PmjnC^) to all 
PumWJ^i- 
//Level 3 



5. Choose pmin = min p(0 over all t. 
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6. For each permutation i: choose pyaff) = minp(t') over all t. " 

7. Calculate the overall corrected? value by comparing pmm to PMN (i) , IS i £ ». 

The time complexity of steps 3.2 and 4.4 is Oft. log n) using an 
ftat sorts die values of the test statistic for all the permutations. Step 2 predominates 
2 ti^fdplexity of die algorithm, 0(n<r), where , is die upper hunt for the 
STZ- *- - a set, , is the time complexity of —I the Z k - 
statistic for all and r is die number of tested locations m the chromosome. 

Due to the finite number of permutations, the precision of the p values given by a 
p^onTests may not be sufficient for accurate localization. In some — 
even a very large number of permutations does not produce any values for the test 
ZZc more extreme flian die observed values for several ^consecutive 
tions. For this purpose .hep values remrnedby the first and second 
tiontests are determined shghtiy unconventionally: At level, 
modified version of algorithm 2 to obtain an upper bound of Z t for all k. At level 2 
Z smallest possible value for the test statistic is zero. These values correspond to 
p vies of l/2(n + «. The rehimed p value is interpolated between thep vahies cor- 
relponding to die next lower and higher values for die test statistic obtained by per- 
mutations The top-level test returning the overall? value is implemented in the 
usual conservative manner. 



20 
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30 



Examples 

Certain embodiments and results of the present invention are described in the fol- 
lowing non-limiting examples. 

We compare TreeDT empirically to TOT, an established mapping method and to 
MuZ decent proposal based on pattern discovery. We evaluate the medrods on 
T^Z data coLtion carefully simulated to resemble a realistic population iso- 

late. 

Example 1 - Simulation of Data 

We designed several different test settings, with variation in the fraction (A) of mu- 
tation carriers in the disease-associated chromosomes, in the number of founders 
wh o introduced the mutation to the population, and in the amount of missing infor- 
mation. For statistical analyses, we created 100 independent artificial data seta m 
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each test setting. Great care was taken to generate realistic data by a simulation pro- 
cedure that included four steps: pedigree generation, simulation of inheritance, di- 
agnosing, and sampling. 

The population pedigree was set to grow from 100 to 100,000 individuals in a pe- 
5 riod of 20 generations. In each generation, the selection of parents for each child 
was random, but once a couple was formed, all subsequent children allocated to ei- 
ther of the parents were set to be common children of the couple. 

The inheritance of chromosomes within the population pedigree was simulated by 
first allocating a continuous chromosomal segment of 100 centiMorgans to each 
10 founder individual in generation 1 . 

Morgan is a unit of genetic length. 1 cM is the distance at which recombination oc- 
curs 1 out of every 100 times, about 10 6 base pairs. Human chromosomes are 
roughly of 50-300 cM. 

Next, the entire pedigree was traversed top-down, and, in each inheritance event, 
15 gametes were created by simulating meiosis under the assumption that the number 
" of chiasmata in the pair of homologous chromosomes was taken from Poisson dis- 
tribution with parameter one (corresponding to the genetic length of 100 cM), and 
their locations selected randomly. A related approach was originally presented m 
(Terwilliger et al., 1993). 

20 For a baseline test setting we selected a challenging disease model where only a 
small proportion (A=10%) of the disease-associated chromosomes carries the dis- 
ease-predisposing mutation, a complication that often is encountered in the analysis 
of common diseases. In the baseline setting there is one founder, and on average 
3.7% of alleles are missing, making the mapping task more difficult but also more 

25 realistic. 

The location of the mutation was selected randomly and independently for each of 
the 100 data sets produced in every setting. Each data set was in turn collected from 
100 affected individuals. The length of the region to be analyzed was 100 cM. Alle- 
lic data were created using a map of 101 equidistantly spaced markers, each having 
30 5 alleles. Both chromosomes of each affected individual in each sample were la- 
beled disease-associated whereas the control chromosomes were constructed from 
the non-transmitted alleles in the parental chromosomes. Each data set thus con- 
sisted of 200 disease-associated and 200 control chromosomes 
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Example 2 - Analysis of TreeDT 

First we assess the prediction accuracy of TreeDT with different values of A, the 
^portion of disease-associated chromosomes that actually carry me mutation 

The results are reported as curves that show the p-« £10M-a 
5 ets where the gene is within the predicted region, as a function of the length of the 
SdTcted region. Or, in other words, the x coordinate tells the cost . geneficist is 
willing to pay, in terms of the length of the region to be further analyzed, and the y 
ToordLte gives the probability that the gene is within the regton. For A=20% or 
5% the accuracy is very good, and with lower values of A me accuracy 
,0 until with A=5% only in 20-30% of data sets can the gene be localized withm a rea 
solTe accuracy of 10-20 cM. We remind the reader that the test setbngs have been 
designed to be challenging, and to test the limits of the approach. 
Next we evalnate the effect of the only parameter of TreeDT, me number of deviant 
"notrees mat are searched for in each tree. An upper limit of 6 subtrees, used in the 
15 previous test, is evaluated against fixed amounts of 1, 2, or 3 subtrees, m th awy- 
L number of founders that introduced the mutation (Figure 5B). As we increase 
the number of founders, evidence about the gene location becomes more frag- 
ment and accordingly the performs degrades. While the differences between 
different numbers of subtrees are not large, it is interesting to note that for each 
20 number of founders, the same number of subtrees gives nwginaUy the bes result 
The upper limit of 6 subtrees gives consisted, competitive results, so we continue 
using it in the following experiments. 

Gene mapping studies like the ones imitated in the above tests assume, based on 
some other analyses, that a disease susceptibility gene is indeed present m the ana- 
25 lyzed area. TreeDT has the important advantage over plain gene locahzanon meth- 
ods that it can also be used to predict whether the analyzed region contains a disease 
susceptibility gene at all or not. The overall, value TreeDT produces 
corrected significance of the best single finding, and by settmg an upper limit for its 
value TreeDT can be used to classify data sets to ones that do or do not contain a 
30 gene. For data sets with no gene, TreeDT cotrectly produces overall? values Art 
are uniformly distributed in [0,1]. So, smaller thresholds for p result in less false 
positives, but also in less true positives. Figure 5C shows the experimental relation- 
ships between power (ratio true positivites/all positives) and overall p (ratio false 
positives/all negatives). For higher values of A the classification accuracy is ex- 
35 tremely good. For A=5% it is comparable to random guessing, although TreeD 1 is 
still able to locate an existing gene adequately in 20-30% of the cases (Figure 5A). 
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Example 3 - Comparison to other methods 

TreeDT HPM, and m-TDT have practically identical performance in localizing the 
DS gene in the baseline setting (Figure 6A). TDT is clearly inferior compared to the 
other methods. Tests with other values of A give similar results. 

5 In a test setting with three founders who introduced the mutation to the population, 
differences between the three best methods start to appear (Figure 6B). TreeDT has 
an edge over HPM, which in turn has an edge over m-TDT. TDT barely beats ran- 
dom guessing. 

Finally we compare the methods with a large amount of missing data (Figure 6C). 
10 Expectedly, HPM is most robust with respect to missing data since it allows gaps m 
its haplotype patterns. Surprisingly, TreeDT is not much weaker than HPM, al- 
though no actions have been taken in it to account for missing or erroneus data. Per- 
formance of m-TDT degrades much more clearly. 

Method to method comparisons (not shown) indicate that the prediction errors are 
15 mostly caused by random effects in population history - since different methods 
tend to make mistakes in the same data sets - rather than by systematic differences 
between the methods. However, those cases where one method succeeds and an- 
other fails will give useful input for further improvements of the methods. 

The execution time of TreeDT for a single dataset is about ten minutes using 1,000 
20 permutations on a 450 MHz Pentium II. The respective time for HPM with permu- 
tations is over 20 minutes. 
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Claims 

1 A method for gene mapping from chromosome and phenotype data, which util- 
izes linkage disequilibrium between genetic markers m h which are polymorphic nu- 
cleic acid or protein sequences or strings of single-nucleotide polymorphisms denv- 
5 ing from a chromosomal region, characterized in that it comprises following steps: 

i) identifying a prefix tree T based on the observed haplotypes at a num- 
ber of locations of a chromosome, 

ii) evaluating each prefix tree Tby its genetic and statistical feasibility, as- 
suming that the gene was close to the root of the tree, and thus deter- 

!0 mining a score for each prefix tree T, 

iii) predicting the area for the location of the gene as a function of the score 
determined in the step (ii). 

2. The method according to claim 1, characterized in that in the step (i) the prefix 
tree T is build between each pair of consecutive markers. 
15 3. The method according to claim 1 or 2, characterized in that the prefix tree T is 
build using a string-sorting algorithm. 

4 A method according to claim 1, characterized in that the prefix tree T is evalu- 
ated by tree disequihbrium test testing the alternative hypothesis The distribution of 
the disease-association statuses deviates in some subtrees of T from the overall dis- 
20 tribution of statuses against the null hypothesis The disease-association statuses are 
randomly distributed in the leaves ofT. 

5. A method according to claim 4, characterized in that for measuring the disequi- 
librium of a tree a test statistic Z k for a tree with k deviant subtrees T h T k is cal- 
culated by the following formula: 

* a , - n , p 

25 z k = 



where a t is the number of disease-associated haplotypes and m the total number of 
haplotypes in subtree T t eS,S being the given subtree set, andp is the proportion of 
disease-associated haplotypes in the sample. 
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6. A method according to ciaim 4 or 5, characterized in that the following algo- 
rithm is used: 

Input: A haplotype prefix tree T 
Output: Maximum values of Z k in the tree Tfor each k 
5 Call Maximize(T) 
MaximizeiT): 
If T is not a leaf: 

1 . For each immediate, subtree T t of T : Recursively call Maximized ■ 

2. For each t calculate the maximum value Zmax, k(X) for Z k (S,T) over all S that 
10 can be obtained by combining subtree sets from each subtree T t of T. 

3 . Calculate Zj for T. If Z x > Zmax, iCO ^en set Zmax, iCO '- = Zi- 
If T is a leaf, then set Zmax, iCD : = 0. 

7. A method according to claim 6, characterized in that step 2 is further refined as 
follows: 

15 2.3 Set Y k : = 0 and Zmax, *CD : = 0 for all k, 1 < * < n, where n is the number of 
leaves in T. 

2.4 For each subtree T of T : 

2.4.1 For each pair (ij), 1 < i <p and 1 <j < q, where p is the number of leaves in 
T and q is the total number of leaves in all the subtrees processed prior to T: 

20 If Zmax, ,<n + Ij > Zmax, then set Zmax, : = Zmax, + Yj. 

2.4.2 For each k,l<k<p: 

If Zmax, *(?") > Zmax, *CD, then set Zmax, • = -Zmax, *CT). 

2.4.3 For each 1 ^ & < p+#: 

If Zmax, AT) > Y k {T), then set Y k (T) : = Zmax. k(T) 
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8. A method according to any of claims 4 to 7, characterized in that the signifi- 
cance of the disequiUbrium at a given location is tested by multiple nested permuta- 
tion test. 

9. A method according to claim 8, characterized in that the permutation test com- 
5 prises following steps: 

- finding for each k the set S of subtrees that maximizes Z k and estimating the 
p value for each maximized Z k 

- estimating a new p value for a combination of the information from the pre- 
fix tree Tto the left and to the right of the location, combined measure being 

10 the product of the lowest p value over all k, and ranking locations by the new 

p values, 

- obtaining the point prediction for the gene location by taking the best loca- 
tion from the p value ranked list of locations and obtaining a single corrected 
p value for the best finding with a test using the lowest local/? value as the 

15 test statistic. 

10. A method according to claim 9, characterized in that the following algorithm 
is used: 

1 . Compute Zmax, kiD = max Z k (T,S) for each subtree count k and each coa- 
lescence tree rover all S £ Subtrees ets(T). 

20 2. Randomly generate n+1 permutations of disease-association statuses for 

the haplotypes and for each permutation i and (T,k): compute Zmax, kihT) 
= max Z k (i,T,S) over all 5" £ SubtreeSets(T). 

II Level 1 

3. For each (T,k): 

25 3.1 Calculate a p value p(T,k) by comparing Z M ax, k(T) to Zmax, ktt.T), 1< i 

3 .2 For each permutation v. calculate a p value p(j, T, k) by comparing 
Zmax, ktt, T) to all Zmax, k(j,T),j * i. 



II Level 2 
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4. For each pair of opposed trees rooted at the same location t = (Ti,T 2 ): 

4. 1 Choose PmM = min piTMpiTM over all k x , k 2 

4.2 For each permutation i: choose pym&t) = nun P(i Ti,k{) P(h Tz,k 2 ) over 
all fei, k 2 . 

5 4.3 Calculate a p value p(t) by comparing Pmin(0 to j>min<£ 0 , 1^ * ^ n. 

4.4 For each permutation i: calculate a p value pii, t) by comparing 
Pmin(U) to allpMiN(/^)» J * 

//Level 3 

5. Choose Pmin = min pit) over all ?. 

10 6. For each permutation i: choose pwsw(.i) = min p(i, t) over all 

7. Calculate the overall corrected p value by comparing ^min to PminCO > 1^ » 
<n. 

11. A computer-readable data storage medium having computer-executable pro- 
gram code stored thereon operative to perform a method of any of preceding claims 

15 when executed on a computer. 

12. A computer system programmed to perform the method of any of claims 1-10. 
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A. Influence of A 
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B. Influence of the number of subtrees 
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A. A=10% 
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Box I Observations where certain claims were found unsearchable (Continuation of hem 1 of first sheet) 

This international search report has not been established in respect of certain claims under Article 17(2)(a) for the following reasons: 

1. ClaimsNos.: 11 , 12 

because they relate to subject matter not required to be searched by this Authority, namely: 

See Art. 17.2a and Rule 39.1.: Presentation of Information and 
computer programs claims 11 and 12 have not been searched 

2. (3 ClaimsNos.: 3 

because they relate to parts of the international application that do not comply with the prescribed requirements to such 
an extent that no meaningful international search can be carried out, specifically: 

see next sheet 



3. |~] ClaimsNos.: 

because they are dependent claims and are not drafted in accordance with the second and third sentences of Rule 6.4(a). 



Box II Observations where unity of invention is lacking (Continuation of item 2 of first sheet) 



This International Searching Authority found multiple inventions in this international application, as follows: 



1. As all required additional search fees were timely paid by the applicant, this international search report covers all 
searchable claims. 

2. Q As all searchable claims could be searched without effort justifying an additional fee, this Authority did not invite payment 

of any additional fee. 

3. Q As only some of the required additional search fees were timely paid by the applicant, this international search report 

covers only those claims for which fees were paid, specifically claims Nos.: 



4. Q No required additional search fees were timely paid by the applicant Consequently, this international search report is 
restricted to the invention first mentioned in the claims; it is covered by claims Nos.: 



□ 
□ 



Remark on Protest 



The additional search fees were accompanied by the applicant's protest. 
No protest accompanied the payment of additional search fees. 
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The term "string- sorting algorithm" is vague and unclear and leave 
the reader in doubt as to the meaning of the technical feature to 
which it refers, thereby rendering the definition of the subject- 
mather of claim 3 unclear (Article 6 PCT) . The term "string- 
sorting" is not defined in the description, thus claim 3 has been 
partially searched. 
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